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A "monkey book" is a book consisting of a random distribution of letters and blanks, where a 
group of letters surrounded by two blanks is defined as a word. We compare the statistics of the 
word distribution for a monkey book with the corresponding distribution for the general class of 
random books, where the latter are books for which the words are randomly distributed. It is 
shown that the word distribution statistics for the monkey book is different and quite distinct from 
a typical sampled book or real book. In particular the monkey book obeys Heaps' power law to an 
extraordinary good approximation, in contrast to the word distributions for sampled and real books, 
which deviate from Heaps' law in a characteristics way. The somewhat counter-intuitive conclusion 
is that a "monkey book" obeys Heaps' power law precisely because its word-frequency distribution 
is not a smooth power law, contrary to the expectation based on simple mathematical arguments 
that if one is a power law, so is the other. 

PACS numbers: 



A. Introduction 

Words in a book occur with different frequencies. Com- 
mon words like "the" occur very frequently and consti- 
tute about 5% of the total number of written words in 
the book, whereas about half the different words only oc- 
cur a single time [JJ. The word-frequency N(k) is defined 
as the number of words which occur /c-times. The cor- 
responding word-frequency distribution (wfd) is defined 
as P(k) = N(k)/N where TV is the total number of dif- 
ferent words. Such a distribution is typically broad and 
is often called "fat-tailed" and "power law "-like. "Power 
law" -like means that the large fc-tail of the distribution 
to a reasonable approximation follows a power law, so 
that P(k) oc 1/fc 7 . Typically, one finds that 7 < 2 for a 
real book [2HZj • What does this broad frequency distribu- 
tion imply? Has it something to do with how the book is 
actually written? Or has it something to do with the evo- 
lution of the language itself? The fact that the wfd has 
a particular form was first associated with the empirical 
Zipf-law for the corresponding word-rank distribution. 0- 
7 Zipf's law corresponds to 7 = 2. Subsequently Herbert 
Simon proposed that the particular form of the wfd could 
be associated with a growth model, the Simon model, 
where the distribution of words was related to a particu- 
lar stochastical way of writing a text from the beginning 
to the end. [5] However, a closer scrutiny of the Simon 
model reveals that the statistical properties implied by 
this model are fundamentally different from what is found 
in any real text. [3J Mandelbrot (at about the same time 
as Simon suggested his growth model) instead proposed 
that the language itself had evolved so as to optimize an 
information measure based on an estimated word cost 
(the more letters needed to build up a word the higher 
cost for the word) . [91 ITU] Thus in this case the power law 
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of the word-distribution was proposed to be a reflection 
of an evolved property of the language itself. However, 
it was later pointed out by Miller in Ref. [IT] that you 
do not need any particular language-evolution optimiza- 
tion to obtain a power law: A monkey randomly typing 
letters and blanks on a type-writer will also produce a 
wfd which is power-law like within a continuum approx- 
imation. The monkey book, hence, at least superficially 
have properties in common with real books 12 16 . The 
case that the relation to real books are just superficial 
have in particular been argued in Refs [13] and [i~5] . 

In 1978, Harold Stanley Heaps [T7] presented another 
empirical law describing the relation between the num- 
ber of different words, TV, and the total number of words, 
M. Heaps' power law states that N(M) oc M a , where 
a is a constant between zero and one. However, it was 
recently shown that Heaps' law gives an inadequate de- 
scription of this relation for real books, and that it needs 
to be modified so that the exponent a changes with the 
size of the book from a = 1 for M = 1 to a = as 
M — > 00 2 . It was also shown that the wfd of real 
books, in general, can be better described by introducing 
an exponential cut off so that P{K) = Aexp(— bk)k _1 
[3J . A simple mathematical derivation of the relation be- 
tween the power-law exponents 7 and a gives the result 
a = 7 — 1 [2 ]. This in turn means that the shape of 
the wfd also changes with the size of the book, so that 
7 = 2 for small TVT, but reaches the limit value 7 = 1 as 
M goes to infinity. The same analysis showed that the 
parameter b is size dependent according to b « 60 /M [2] . 
It was also shown empirically that the works of a single 
author follows the same TV(TV/)-curve to a good approx- 
imation and which was further manifested in the meta- 
book concept: the TV (TV/ )-curve characterizing a text of an 
individual author is obtainable by pulling sections from 
the authors collective meta book. [5] As will be further 
discussed below, the shape of the TV(TVf) curve is mathe- 
matically closely related to the Random Book Transfor- 
mation (RBT) g][3J. 
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As mentioned above, the writing of a real book cannot 
be described by a growth model because the statistical 
properties of a real book are translational invariant [3]. 
The monkey book, on the other hand, is produced by a 
translational-invariant stationary process. An important 
question of much attention is then how close the statisti- 
cal properties of the monkey book really are to those of ST 
a real book. It is shown in the present work that in the d 
context of Heaps' law the answer is somewhat paradoxi- 
cal. 



B. Monkey book 

Imagine an alphabet with ^4-letters and a typewriter 
with a keyboard with one key for each letter and a space 
bar. For a monkey randomly typing on the typewriter 
the chance for hitting the space bar is assumed to be q s 
and the chance for hitting any of the letters is ■ A 

word is then defined as a sequence of letters surrounded 
by blanks. What is the resulting wfd for a text containing 
M words? Miller in Ref. [11] found that in the continuum 
limit this is in fact a power law. In the appendix we re- 
derive this result using an information cost method. A 
more standard alternative derivation can be found in Ref. 

®- 

We will denote the word-frequency distributions in the 
continuum limit by p{k) and in the Monkey book case it 
is given by 
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Thus, if q s = 1/(^+1) then 7 = 1 if A = 1 and 7 = 2 
in the infinite limit of A. 



1. Continuum approximation versus real word-frequency 



FIG. 1: Word-frequency distribution for the monkey book, 
(a) Broken straight line corresponds to the continuum approx- 
imation p(k) oc fc -7 given by Eq.[2] whereas the full curve with 
disjunct peaks represents the real distribution P(k). P(k) 
and its continuum approximation p(k) axe clearly very differ- 
ent, (b) The corresponding cumulative distributions f(k) and 
F(k). Broken straight line corresponds to f(k) oc fc - ' 7-1 ' 
and the black zig-zag line to the corresponding real cumula- 
tive distribution F(k). Note that f(k) to good approximation 
is an envelope of the black zig-zag F(k). The gray zig-zag- 
curve is the cumulative F(k) for a tenth of the monkey book. 
Note that f(k) still gives an equally good envelop. Thus the 
envelop of the cumulative F(k) for a monkey book is a size- 
independent power law. 



The above result for p{k) is an approximation of the 
actual (discrete) result expected from random typing. 
The true wfd of the model will here be denoted as P(k). 
What is then the relation between the power-law form 
of p(k) and the actual probability, P(k), for a word to 
occur fc-times in the text? It is quite straight-forward 
to let a computer take the place of a monkey and simu- 
late monkey books 12J. Fig. la gives an example for an 
alphabet with A — 4 letters, a total lumber of words 
M = 10 6 and with the chance to hit the space bar 
q s = 1/(^1+1) = 1/5. Such a book should have a power- 
law exponent of 7 ~ 1.86 according to Eq. [2| Note that 
P(k) for higher k consists of disjunct peaks: the peak 



with the highest k corresponds to the A = 4 one-letter 
words, the next towards lower k to the A 2 = 16 two- 
letter words and so forth. Thus the power law tail 1/fc 7 
in the case of a monkey-book is not a smooth tail but 
a sequence of separated peaks as previously reported in 
Ref. [F2HT6] . So what is the relation to the continuum 
p(k) oc fc~ 186 ? Plotted in log-log scales as in Fig. la, 
p(k) is just a straight-line with the slope —7 = —1.86 
(broken line in Fig. la). Represented in this way there 
is no obvious discernible relation between the separated 
peaks of P(k) and the straight line given by p(k). In 
order to directly see the connection one can instead com- 
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pare the cumulative distributions F(k) — Y^k'=k P(h') 
and /0) = E^=feP( fe « 1/fc 86 - In Fig. lb, F(k) cor- 
responds to the full drawn zig-zag-curve and the straight 
broken line with slope —0.86 to the continuum approxi- 
mation f(k). In this plot the connection is more obvious: 
/(fc) is an envelope of F(k). Figure lb also illustrates 
that the envelop slope for the monkey book is indepen- 
dent of the length of the book: The full drawn zigzag 
curve corresponds to M = 10 6 whereas the dotted zigzag 
curve corresponds to M = 10 5 . Both of them have the 
envelop slope —7 = —0.86 given by the continuum ap- 
proximation f(k). 

To sum up: The continuum approximation p(k) oc 
1/fc 7 is very different from the actual spiked monkey 
book, P(k). However, the envelop, f(k), for the cumu- 
lative wfd, F(k), of the monkey book is nevertheless a 
power law with a slope which is independent of the size 
of the book. 



C. Heaps' law 

Heaps' law is an empirical law which states that the 
number of different words, N, in a book approximately 
increases as N(M) oc M a as a function of the total num- 
ber of words |17| . For a random book, like the monkey 
book, there is a direct connection between P{k) and the 
TV (M)- curve. A random book means a book where the 
class of words which occurs k times are randomly dis- 
tributed throughout the book: The chance of finding a 
word with frequency k is independent of the position in 
the book i.e. it is as likely to find a word with a fre- 
quency k at the beginning, in the middle or at the end 
of the book. Suppose that such a book of size M has a 
wfd Pjw(fe) created by sampling a fixed theoretical prob- 
ability distribution p(k) oc fc~ 7 , where the normalization 
constant is only weakly dependent on M. The number 
of different words for a given size is then related to M 
through the relation 



M = N(M) kp(k) 



(3) 



k=l 



and, since in the present case 
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it follows that 



N(M) oc M 7 ~ 



(5) 



A heuristic direct way to this result is to argue that 
the first time for a word with frequency k to occur is 
inversely proportional to its frequency r oc 1/fc, so that 




FIG. 2: Heaps law for monkey books with different sizes of 
the alphabet, in log-log scale. The full curves from top to 
bottom gives the N(M) for alphabets of length m = 4, 2, and 
1, respectively. According to Eq.(7) the N(M) should for m = 
4, and 2 follows Heaps power laws with the exponents 0.86 and 
0.63, respectively, and the corresponding broken lines show 
that these predictions are borne out to excellent precision. 
For m = 1, Eq. (7) predicts that N(M) instead should be 
proportional to InM, since 7 — 1 = 0. The corresponding 
broken curve again shows an excellent agreement. 
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t is propor- 
you are, this means 
The conclusion from 
Eqs. j3p1 is that the N(M) -curve of a random book with 
PmW ^ 7 should follow Heaps' law very precisely 
with the power-law index a = 7— 1. One consequence of 
this is that if you start with such a book of size M and 
the number of different words N(M) and then randomly 
pick half the words, then this new book of M/2 words 
will on the average have N(M/2) oc (M/2) Q = 7 " :L differ- 
ent words. Thus, starting from a random book of size M, 
you can obtain the complete N (M)-curve by dividing the 
book into parts of smaller sizes. Furthermore, in the spe- 
cial case where PmQz) is a power law with a functional 
form, and a power-law index, which is size-independent, 
the 7V(M)-curve follows Heaps' law very precisely with 
a = 7 — 1. Figure 2 illustrates that this is indeed true 
for monkey books by showing the iV(M)-curve for differ- 
ent alphabet sizes (full drawn curves) together with the 
corresponding analytic solutions (broken curves). Note 
that for Heaps' law, N(M) a M a , and the relationship 
a — 7 — 1 to hold, the full curves should be parallel to the 
broken curves for each alphabet size, respectively. Also, 
the continuum theory from Eq. [2] gives 7 = 1 for .4=1 
(an alphabet with a single letter) which by Eq. [5]predicts 
N oc InM, and which is again in full agreement with the 
monkey book. 

However, notwithstanding this excellent agreement, 



4 



the reasoning is nevertheless flawed by a serious inconsis- 
tency: The connection to Heaps' law, N(M) oc M a , was 
here established for a random book with a continuous 
power-law wfd, whereas the wfd of a monkey book con- 
sists of a series of disjunct peaks, and only the envelope 
of its cumulative wfd can be described by a continuous 
power law. It thus seems reasonable that a random book 
with a wfd which is well described by a smooth power 
law would satisfy Heaps' law to an even greater extent. 
However, this reasoning is not correct. The derived form 
of Heaps' law, N(M) oc M 7 ^ 1 is based on a wfd for which 
the functional form and 7 is size independent. But as we 
will show in the following section, this is an impossibility: 
a random book with a continuous wfd can in principle not 
be described by a size-independent power-law. 



D. Contradicting power laws 

The most direct way to realize this inconsistency prob- 
lem is to start from a random book which has a smooth 
power-law wfd with an index 7. Such a book can be 
obtained by randomly sampling word frequencies from a 
continuous power-law distribution of a given 7 and then 
placing them, separated by blanks, randomly on a line. 
For this " sampled book" one can then directly obtain the 
7V(M)-curve by dividing the book into parts, as described 
above. Fig. 3a gives an example of a N(M)-cmve for a 
sampled book with 7 = 1.86, N = 10 5 and M = 10 6 . 
The resulting wfd is shown in Fig. 3b. 

It is immediately clear from Fig. 3a that a sampled 
book with a power-law wfd does not have an N(M)- 
curve which follows Heaps' law, N(M) oc M a (it devi- 
ates from the straight line in the figure). This is thus 
in contrast to the result of the derivation given by Eq. 
[3][5] and the monkey-book which does obey Heaps' law, 
N(M) oc M 7_1 , as seen from Fig. 2. This means that 
the monkey book obeys Heaps' law because the wfd is 
not well described by a smooth power law, and that the 
"spiked" form of the monkey-P(fc) is, in fact, crucial for 
the result. The explanation for the size invariance of the 
monkey book can be found in the derivation presented 
in the appendix. Since the frequency of each word is 
exponential in the length of the word, it naturally intro- 
duces a discrete size-invariant property of the book. This 
discreteness is responsible for the disjunct peaks shown 
in Fig. la, and it is easy to realize that non-overlapping 
Gaussian peaks will transform into new Gaussian peaks 
with conserved relative amplitudes, thus resulting in a 
size-independent envelope. 

The core of this paradoxical behavior lies in the fact 
that the derived form of the -/V(M)-curve requires a size- 
independent wfd, and that a random book is always sub- 
ject to well-defined statistical properties. One of these 
properties is that the Pm(&) transforms according to 
the RBT (random book transformation) when dividing 
it into parts [TJ [5] : The probability for a word that ap- 
pears k! times in the full book of size M to appear k 
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FIG. 3: Results for a "sampled book" of length M described 
by a smooth power law wfd P(k) oc k~"' . (a) Full drawn 
curve is the real N(M) whereas the broken straight line is 
the Heaps' power law prediction from Eq.(7). Since the real 
iV(M)-curve is bent, it is clear that a power law wfd does not 
give a power law N(M). (b) illustrates that the wfd obtained 
for a part of the full book containing M' words where r = 
M/M' has a different functional form than Pm- The curves 
show the cumulative distributions F(k) — Ylk'=k -^CO f° r 
the full random book M — 10 6 and M' = 5000, respectively. 



times in a smaller section of size M' can be expressed 
in binomial coefficients: Let fV(fc') and PM'(k) be two 
column matrices with elements numerated by k' and fc, 
then 



M 



PM'(k) = cY, A kk'P M (k') 



(6) 



k' = k 



where A k k' is the triangular matrix with the elements 



A kk > =(r-l) 



k'-k 




(7) 
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and r = M/M' is the ratio of the book sizes. The 
normalization factor C is 



Suppose that Pjw(fc) is a power law with an index 7. 
The requirement for the corresponding random book to 
obey Heaps' law is then that Pm(&) under the RBT- 
transformation remains a power law with the same index 
7. However, the RBT-transformation does not leave in- 
variant a power law with an index 7 > 1 [3|. This 
fact is illustrated in Fig. 3c, which shows that a power 
law Pm(^) changes its functional form when describing 
a smaller part of the book. This change of the func- 
tional form is the reason for why the A(M)-curve in Fig. 
3a does not obey Heaps' law. The implication of this 
is that a random book which is well described by the 
continuum approximation p(k) cx 1/fc 7 can never have a 
A(M)-curve of the Heaps law form N(M) cx N a . 

In Fig. 4a-c we compare the result for a power law 
Pjw(fc) in Fig. 3a-c to the real book Moby Dick by Her- 
man Melville. Fig. 4a shows the N(M) for M « 212000 
both for the real book and for the randomized ver- 
sion (where the words in the real book are randomly 
re-distributed throughout the book) [3 . As seen, the 
A(M)-curve for the real and randomized book are closely 
the same and very reminiscent of the pure power-law case 
in Fig. 3a: A real and random book, as well as a power- 
law book, deviates from Heaps' law in the same way. In 
Fig. 4c we show that the reason is the same: The form 
of the wfd changes with the size of the book in similar 
ways. The result for the real book is not a property solely 
found in Moby Dick, but has previously been shown to 
be an ubiquitous feature of novels [2]. 

To sum up: A simple mathematical derivation tells us 
that if the wfd is well described by a power law, then so 
is the iV(M)-curve. This power-law form N(M) cx M a 
is called Heaps' law. However, a sampled book, as well 
as real books, does not follow Heaps' law, in spite of the 
fact that their wfds are well described by smooth power 
laws. In contrast, the monkey book which has a spiky, 
disjunct, wfd, does obey Heap's law very well. 
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FIG. 4: Comparison with a real book, (a) iV(M)-curves for 
Moby Dick (dark curve) and for the randomized Moby Dick 
(light curve) together with a power law (straight broken line). 
Real and random Moby Dick has to excellent approximation 
the same N(M) and this N (M)-curve is not a power law. 
Note the striking similarity with Fig. 3a. (b) Change in the 
cumulative distribution F(k) with text length for Moby Dick, 
dark curve corresponds to the full length Mtot ~ 212000 words 
and the light curve to M' « 2000 (r = M/M' = 100). The 
change in the functional form of the wfd is very similar to the 
power law book shown in Fig. 3b. 



E. Conclusions 

We have shown that the 7V(M)-curve for a monkey 
book obeys Heaps' power-law form N(M) cx M a very 
precisely. This is in contrast to real and randomized 
real books, as well as sampled books with word-frequency 
distributions (wfd) which are well described by smooth 
power laws: All of these have 7V(M)-curves which deviate 
from Heaps' law in similar ways. In addition we discussed 
the incompatibility of simultaneous power-law forms of 
the wfd and the iV(M)-curves (Heaps' law). This led to 
the somewhat counter-intuitive conclusion that Heaps' 
power law requires a wfd which is not a smooth power 



law! We have argued that the reason for this inconsis- 
tency is that the simple derivation that leads to Heaps' 
law when starting from a power-law wfd assumes that the 
functional form is size independent when sectioning down 
the book to smaller sizes. However, it is shown, using the 
Random book transformation (RBT), that this assump- 
tion is in fact not true for real or randomized books, nor 
for a sampled power-law book. In contrast, a monkey 
book, which has a spiked and disjunct wfd, possesses an 
invariancc under this transformation. It is shown that 
this invariance is a direct consequence of the discreteness 
in the frequencies of words due to the discreteness in the 
length of the words (see appendix) . 
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I. APPENDIX: THE INFORMATION COST 
METHOD 

Lets imagine a monkey typing on a keyboard with A 
letters and a space bar, where the chance for typing space 
is q s and for any of the letters is 3^ ■ ^ ^ ex ^ produced 
by this monkey has a certain information content given 
by the entropy of the letter configurations produced by 
the monkey. These configurations result in a word fre- 
quency distribution (wfd) P(k) and the corresponding 
entropy S = —J2kP(k)\~n.P(k) gives a measure of the 
information associated with this frequency distribution. 
The most likely P{k) corresponds to the maximum of 
S under the appropriate constraints. This can equiva- 
lent be viewed as the minimum information loss, or cost, 
in comparison with an unconstrained P(k) [18] . Con- 
sequently, the minimum-cost P{k) gives the most likely 
wfd for a monkey. 

Since the wfd in the continuum approximation is dif- 
ferent from the real distribution P(k), we will call the 
former p{k). Let k be the frequency with which a specific 
word occurs in a text and let the corresponding proba- 
bility distribution be p(k)dk. This means that p(k)dk 
is the probability that a word belongs to the frequency 
interval [k, k+dk]. The entropy associated with the prob- 
ability distribution p(k) is S — ~ ^2 kP{k) hi p(k) (where 
^ fe implies an integral whenever the index is a contin- 
uous variable). Let M(l)dl be the number of words in 
the word-letter length interval [I, l + dl]. This means that 
the number of words in the frequency interval [k, k + dk] 
is M(l)^j:dk because all words of a given length / oc- 
cur with the same frequency. The number of distinct 
words in the same interval is n(k)dk — Np(k)dk, which 
means that is the degeneracy of a word with fre- 

quency k. The information loss due to this degeneracy is 
ln (^yf ) = M M V)§) ~ ln M fc ) + const(m nats). The 
average information loss is given by 

host = I" ln ^( fc ) + HM(l)dl/dk)} (9) 

and this is the appropriate information cost associated 
with the words: The p{k) which minimizes this cost cor- 
responds to the most likely p{k). The next step is to 
express M(l) and dl/dk in terms of the two basic prob- 
ability distributions, p(k) and the probability for hitting 
the keys: M(l) is just M(l) ~ A 1 . The frequency k for a 
world containing I letters is 



(10) 



Thus k ~ exp(aZ) with a — ln(l — q s ) — hi A so that 
dk/dl — ka and, consequently, Ii oss = — ^2p(k) lnp(k) + 
^2p(k)[lnA l — In fca]. Furthermore, hi{A l /ka) = IhiA — 
In k — In a and from EgflO] one gets I = \n(k/q s )/ ln(l — 
q s )/A) from which follows that hi(A l /ka) = ( — 1 + 
in(i-g 'f-in^ ) ^ n k + const. Thus the most likely distribu- 
tion p(k) corresponds to the minimum of the information 
word cost 



y^p(k) lnp(fc) + y^p(fc)lnfc" 



with 
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21n„4-ln(l-g s ) 
In .4- ln(l - q s ) 

Variational calculus then gives \n(p(k)k' y ) 

p(k) oc kT 1 . 



(11) 



(12) 



const so that 



(13) 



Note that the total number of words M only enter this es- 
timate though the normalization condition. This means 
that the continuum approximation p(k) oc for the 
monkey-book is independent of how many words M it 
contains. Thus if you start from a monkey-book with 
M words and you randomly pick a fraction of these M 
words, then this smaller book will also a have a wfd which 
in the continuum limit follows the same power-law. This 
is a consequence of the fact that the frequency k for a 



word a length I is always given by eq.(10) irrespective of 
the book-size. It is this specific monkey-book constraint 
which makes I cos t hi eq. ^ M-invariant and hence forces 
the continuum p(k) to always follow the same power-law. 
The crucial point to realize is that the very same con- 
staint forces the real P(k) to have a "peaky" structure. 
One should also note that if you started from a book con- 
sisting of M words randomly drawn from the continuum 
p(k) then a randomly drawn fraction from this book will 
no longer follow the original power-law. 
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